15 research outputs found

    Haplotype resolved genomes:Computational challenges and applications

    Get PDF
    Genomes of diploid organisms, like humans, are organized in pairs of chromosomes, one inherited from the father and one from the mother. Each homologous chromosome harbors a specific set of parental alleles, called haplotype. Unfortunately, to obtain haplotype information using current methods remains challenging. Here we introduce a single cell DNA template strand sequencing (Strand-seq) as a novel haplotyping approach able to separate parental alleles along the entire length of all chromosomes. We demonstrate this by building a complete haplotypes for HapMap individual (NA12878) at high accuracy (concordance 99.3%), without using generational information or statistical inference. Furthermore we mapped all meiotic recombination events in a family trio with high resolution (median range ~14 kb), and phased larger structural variants like deletions, indels as well as balanced rearrangements like inversions. The single cell resolution of Strand-seq allowed us to observe loss of heterozygosity regions in a small number of cells, a significant advantage for studies of heterogeneous cell populations, such as cancer cells. Lastly, we prove that integration of Strand-seq with other whole-genome sequencing methods brings significant increase in haplotype completeness while reducing sequencing costs. The implementation of Strand-seq and our analysis pipeline brings a powerful, high-throughput approach to assemble haplotypes that will open up new possibilities to study diploid architecture of human genomes in health and disease

    Multi-platform​ ​ discovery​ ​ of​ ​ haplotype-resolved structural​ ​ variation​ ​ in​ ​ human​ ​ genomes

    Get PDF
    The incomplete identification of structural variants from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long- and short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three human parent-child trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,181 indel variants (<50 bp) and 31,599 structural variants (≥50 bp) per human genome, a seven fold increase in structural variation compared to previous reports, including from the 1000 Genomes Project. We also discovered 156 inversions per genome, most of which previously escaped detection, as well as large unbalanced chromosomal rearrangements. We provide near-complete, haplotype-resolved structural variation for three genomes that can now be used as a gold standard for the scientific community and we make specific recommendations for maximizing structural variation sensitivity for future large-scale genome sequencing studies

    Direct chromosome-length haplotyping by single-cell sequencing

    Get PDF
    Haplotypes are fundamental to fully characterize the diploid genome of an individual, yet methods to directly chart the unique genetic makeup of each parental chromosome are lacking. Here we introduce single-cell DNA template strand sequencing (Strand-seq) as a novel approach to phasing diploid genomes along the entire length of all chromosomes. We demonstrate this by building a complete haplotype for a HapMap individual (NA12878) at high accuracy (concordance 99.3%), without using generational information or statistical inference. By use of this approach, we mapped all meiotic recombination events in a family trio with high resolution (median range ∼14 kb) and phased larger structural variants like deletions, indels, and balanced rearrangements like inversions. Lastly, the single-cell resolution of Strand-seq allowed us to observe loss of heterozygosity regions in a small number of cells, a significant advantage for studies of heterogeneous cell populations, such as cancer cells. We conclude that Strand-seq is a unique and powerful approach to completely phase individual genomes and map inheritance patterns in families, while preserving haplotype differences between single cells

    Computational pan-genomics: status, promises and challenges

    Get PDF
    Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains
    corecore